GATK3 DepthOfCoverage: Determine coverage at different levels of partitioning and aggregation.¶

Gatk3DepthOfCoverage · 1 contributor · 2 versions

Overview This tool processes a set of bam files to determine coverage at different levels of partitioning and aggregation. Coverage can be analyzed per locus, per interval, per gene, or in total; can be partitioned by sample, by read group, by technology, by center, or by library; and can be summarized by mean, median, quartiles, and/or percentage of bases covered to or beyond a threshold. Additionally, reads and bases can be filtered by mapping or base quality score.

Input One or more bam files (with proper headers) to be analyzed for coverage statistics (Optional) A REFSEQ file to aggregate coverage to the gene level (for information about creating the REFSEQ Rod, please consult the online documentation) Output Tables pertaining to different coverage summaries. Suffix on the table files declares the contents:

no suffix: per locus coverage _summary: total, mean, median, quartiles, and threshold proportions, aggregated over all bases _statistics: coverage histograms (# locus with X coverage), aggregated over all bases _interval_summary: total, mean, median, quartiles, and threshold proportions, aggregated per interval _interval_statistics: 2x2 table of # of intervals covered to >= X depth in >=Y samples _gene_summary: total, mean, median, quartiles, and threshold proportions, aggregated per gene _gene_statistics: 2x2 table of # of genes covered to >= X depth in >= Y samples _cumulative_coverage_counts: coverage histograms (# locus with >= X coverage), aggregated over all bases _cumulative_coverage_proportions: proprotions of loci with >= X coverage, aggregated over all bases

Quickstart¶

from janis_bioinformatics.tools.gatk3.depthofcoverage.versions import GATK3DepthOfCoverage_3_8_1

wf = WorkflowBuilder("myworkflow")

wf.step(
    "gatk3depthofcoverage_step",
    GATK3DepthOfCoverage_3_8_1(
        bam=None,
        reference=None,
        outputPrefix=None,
    )
)
wf.output("sample", source=gatk3depthofcoverage_step.sample)
wf.output("sampleCumulativeCoverageCounts", source=gatk3depthofcoverage_step.sampleCumulativeCoverageCounts)
wf.output("sampleCumulativeCoverageProportions", source=gatk3depthofcoverage_step.sampleCumulativeCoverageProportions)
wf.output("sampleIntervalStatistics", source=gatk3depthofcoverage_step.sampleIntervalStatistics)
wf.output("sampleIntervalSummary", source=gatk3depthofcoverage_step.sampleIntervalSummary)
wf.output("sampleStatistics", source=gatk3depthofcoverage_step.sampleStatistics)
wf.output("sampleSummary", source=gatk3depthofcoverage_step.sampleSummary)

OR

Install Janis
Ensure Janis is configured to work with Docker or Singularity.
Ensure all reference files are available:

Note

More information about these inputs are available below.

Generate user input files for Gatk3DepthOfCoverage:

# user inputs
janis inputs Gatk3DepthOfCoverage > inputs.yaml

inputs.yaml

bam: bam.bam
outputPrefix: <value>
reference: reference.fasta

Run Gatk3DepthOfCoverage with:

janis run [...run options] \
    --inputs inputs.yaml \
    Gatk3DepthOfCoverage

Information¶

ID:	`Gatk3DepthOfCoverage`
URL:	https://github.com/broadinstitute/gatk-docs/blob/master/gatk3-tooldocs/3.8-0/org_broadinstitute_gatk_engine_CommandLineGATK.html
Versions:	3.8-1, 3.8-0
Container:	broadinstitute/gatk3:3.8-1
Authors:	Jiaan Yu
Citations:
Created:	2020-04-09
Updated:	2020-04-09

Outputs¶

name	type	documentation
sample	TextFile
sampleCumulativeCoverageCounts	TextFile
sampleCumulativeCoverageProportions	TextFile
sampleIntervalStatistics	TextFile
sampleIntervalSummary	TextFile
sampleStatistics	TextFile
sampleSummary	TextFile

Additional configuration (inputs)¶

name	type	prefix	position	documentation
bam	IndexedBam	-I	10	Input file containing sequence data (BAM or CRAM)
reference	FastaWithIndexes	-R		Reference sequence file
outputPrefix	String	-o		An output file created by the walker. Will overwrite contents if file exists
intervals	Optional<File>	-L		One or more genomic intervals over which to operate
excludeIntervals	Optional<File>	–excludeIntervals		One or more genomic intervals to exclude from processing
argFile	Optional<File>	–arg_file		Reads arguments from the specified file
showFullBamList	Optional<Boolean>	–showFullBamList		Emit list of input BAM/CRAM files to log
read_buffer_size	Optional<Integer>	–read_buffer_size		Number of reads per SAM file to buffer in memory
read_filter	Optional<Boolean>	–read_filter		Filters to apply to reads before analysis
disable_read_filter	Optional<Boolean>	–disable_read_filter		Read filters to disable
interval_set_rule	Optional<String>	–interval_set_rule		Set merging approach to use for combining interval inputs (UNION\|INTERSECTION)
interval_merging	Optional<String>	–interval_merging		Set merging approach to use for combining interval inputs (UNION\|INTERSECTION)
interval_padding	Optional<Integer>	–interval_padding		Amount of padding (in bp) to add to each interval
nonDeterministicRandomSeed	Optional<Boolean>	–nonDeterministicRandomSeed		Use a non-deterministic random seed
maxRuntime	Optional<String>	–maxRuntime		Unit of time used by maxRuntime (NANOSECONDS\|MICROSECONDS\|SECONDS\|MINUTES\|HOURS\|DAYS)
downsampling_type	Optional<String>	–downsampling_type		Type of read downsampling to employ at a given locus (NONE\|ALL_READS\|BY.sample)
downsample_to_fraction	Optional<Float>	–downsample_to_fraction		Fraction of reads to downsample to Target coverage threshold for downsampling to coverage
baq	Optional<String>	–baq		Type of BAQ calculation to apply in the engine (OFF\|CALCULATE_AS_NECESSARY\|RECALCULATE)
refactor_NDN_cigar_string	Optional<Boolean>	–refactor_NDN_cigar_string		Reduce NDN elements in CIGAR string
fixMisencodedQuals	Optional<Boolean>	–fixMisencodedQuals		Fix mis-encoded base quality scores
allowPotentiallyMisencodedQuals	Optional<Boolean>	–allowPotentiallyMisencodedQuals		Ignore warnings about base quality score encoding
useOriginalQualities	Optional<Boolean>	–useOriginalQualities		Use the base quality scores from the OQ tag
defaultBaseQualities	Optional<Integer>	–defaultBaseQualities		Assign a default base quality
performanceLog	Optional<Filename>	–performanceLog		Write GATK runtime performance log to this file
BQSR	Optional<File>	–BQSR		Input covariates table file for on-the-fly base quality score recalibration
disable_indel_quals	Optional<Boolean>	–disable_indel_quals		Disable printing of base insertion and deletion tags (with -BQSR)
emit_original_quals	Optional<Boolean>	–emit_original_quals		Emit the OQ tag with the original base qualities (with -BQSR)
preserve_qscores_less_than	Optional<Integer>	–preserve_qscores_less_than		Don’t recalibrate bases with quality scores less than this threshold (with -BQSR)
countType	Optional<String>	–countType		overlapping reads from the same fragment be handled? (COUNT_READS\|COUNT_FRAGMENTS\|COUNT_FRAGMENTS_REQUIRE_SAME_BASE)
summaryCoverageThreshold	Optional<Array<Integer>>	-ct		Coverage threshold (in percent) for summarizing statistics

Workflow Description Language¶

version development

task Gatk3DepthOfCoverage {
  input {
    Int? runtime_cpu
    Int? runtime_memory
    Int? runtime_seconds
    Int? runtime_disks
    File bam
    File bam_bai
    File reference
    File reference_fai
    File reference_amb
    File reference_ann
    File reference_bwt
    File reference_pac
    File reference_sa
    File reference_dict
    String outputPrefix
    File? intervals
    File? excludeIntervals
    File? argFile
    Boolean? showFullBamList
    Int? read_buffer_size
    Boolean? read_filter
    Boolean? disable_read_filter
    String? interval_set_rule
    String? interval_merging
    Int? interval_padding
    Boolean? nonDeterministicRandomSeed
    String? maxRuntime
    String? downsampling_type
    Float? downsample_to_fraction
    String? baq
    Boolean? refactor_NDN_cigar_string
    Boolean? fixMisencodedQuals
    Boolean? allowPotentiallyMisencodedQuals
    Boolean? useOriginalQualities
    Int? defaultBaseQualities
    String? performanceLog
    File? BQSR
    Boolean? disable_indel_quals
    Boolean? emit_original_quals
    Int? preserve_qscores_less_than
    String? countType
    Array[Int]? summaryCoverageThreshold
  }
  command <<<
    set -e
    cp -f '~{bam_bai}' $(echo '~{bam}' | sed 's/\.[^.]*$//').bai
    java \
      -Xmx~{((select_first([runtime_memory, 4]) * 3) / 4)}G \
      -jar /usr/GenomeAnalysisTK.jar \
      -T DepthOfCoverage \
      -R '~{reference}' \
      -o '~{outputPrefix}' \
      ~{if defined(intervals) then ("-L '" + intervals + "'") else ""} \
      ~{if defined(excludeIntervals) then ("--excludeIntervals '" + excludeIntervals + "'") else ""} \
      ~{if defined(argFile) then ("--arg_file '" + argFile + "'") else ""} \
      ~{if (defined(showFullBamList) && select_first([showFullBamList])) then "--showFullBamList" else ""} \
      ~{if defined(read_buffer_size) then ("--read_buffer_size " + read_buffer_size) else ''} \
      ~{if (defined(read_filter) && select_first([read_filter])) then "--read_filter" else ""} \
      ~{if (defined(disable_read_filter) && select_first([disable_read_filter])) then "--disable_read_filter" else ""} \
      ~{if defined(interval_set_rule) then ("--interval_set_rule '" + interval_set_rule + "'") else ""} \
      ~{if defined(interval_merging) then ("--interval_merging '" + interval_merging + "'") else ""} \
      ~{if defined(interval_padding) then ("--interval_padding " + interval_padding) else ''} \
      ~{if (defined(nonDeterministicRandomSeed) && select_first([nonDeterministicRandomSeed])) then "--nonDeterministicRandomSeed" else ""} \
      ~{if defined(maxRuntime) then ("--maxRuntime '" + maxRuntime + "'") else ""} \
      ~{if defined(downsampling_type) then ("--downsampling_type '" + downsampling_type + "'") else ""} \
      ~{if defined(downsample_to_fraction) then ("--downsample_to_fraction " + downsample_to_fraction) else ''} \
      ~{if defined(baq) then ("--baq '" + baq + "'") else ""} \
      ~{if (defined(refactor_NDN_cigar_string) && select_first([refactor_NDN_cigar_string])) then "--refactor_NDN_cigar_string" else ""} \
      ~{if (defined(fixMisencodedQuals) && select_first([fixMisencodedQuals])) then "--fixMisencodedQuals" else ""} \
      ~{if (defined(allowPotentiallyMisencodedQuals) && select_first([allowPotentiallyMisencodedQuals])) then "--allowPotentiallyMisencodedQuals" else ""} \
      ~{if (defined(useOriginalQualities) && select_first([useOriginalQualities])) then "--useOriginalQualities" else ""} \
      ~{if defined(defaultBaseQualities) then ("--defaultBaseQualities " + defaultBaseQualities) else ''} \
      --performanceLog '~{select_first([performanceLog, "generated"])}' \
      ~{if defined(BQSR) then ("--BQSR '" + BQSR + "'") else ""} \
      ~{if (defined(disable_indel_quals) && select_first([disable_indel_quals])) then "--disable_indel_quals" else ""} \
      ~{if (defined(emit_original_quals) && select_first([emit_original_quals])) then "--emit_original_quals" else ""} \
      ~{if defined(preserve_qscores_less_than) then ("--preserve_qscores_less_than " + preserve_qscores_less_than) else ''} \
      ~{if defined(countType) then ("--countType '" + countType + "'") else ""} \
      ~{if (defined(summaryCoverageThreshold) && length(select_first([summaryCoverageThreshold])) > 0) then sep(" ", prefix("-ct ", select_first([summaryCoverageThreshold]))) else ""} \
      -I '~{bam}'
  >>>
  runtime {
    cpu: select_first([runtime_cpu, 1])
    disks: "local-disk ~{select_first([runtime_disks, 20])} SSD"
    docker: "broadinstitute/gatk3:3.8-1"
    duration: select_first([runtime_seconds, 86400])
    memory: "~{select_first([runtime_memory, 4])}G"
    preemptible: 2
  }
  output {
    File sample = outputPrefix
    File sampleCumulativeCoverageCounts = (outputPrefix + ".sample_cumulative_coverage_counts")
    File sampleCumulativeCoverageProportions = (outputPrefix + ".sample_cumulative_coverage_proportions")
    File sampleIntervalStatistics = (outputPrefix + ".sample_interval_statistics")
    File sampleIntervalSummary = (outputPrefix + ".sample_interval_summary")
    File sampleStatistics = (outputPrefix + ".sample_statistics")
    File sampleSummary = (outputPrefix + ".sample_summary")
  }
}

Common Workflow Language¶

#!/usr/bin/env cwl-runner
class: CommandLineTool
cwlVersion: v1.2
label: |-
  GATK3 DepthOfCoverage: Determine coverage at different levels of partitioning and aggregation.
doc: |-
  Overview
  This tool processes a set of bam files to determine coverage at different levels of partitioning and aggregation. Coverage can be analyzed per locus, per interval, per gene, or in total; can be partitioned by sample, by read group, by technology, by center, or by library; and can be summarized by mean, median, quartiles, and/or percentage of bases covered to or beyond a threshold. Additionally, reads and bases can be filtered by mapping or base quality score.

  Input
  One or more bam files (with proper headers) to be analyzed for coverage statistics
  (Optional) A REFSEQ file to aggregate coverage to the gene level (for information about creating the REFSEQ Rod, please consult the online documentation)
  Output
  Tables pertaining to different coverage summaries. Suffix on the table files declares the contents:

  no suffix: per locus coverage
  _summary: total, mean, median, quartiles, and threshold proportions, aggregated over all bases
  _statistics: coverage histograms (# locus with X coverage), aggregated over all bases
  _interval_summary: total, mean, median, quartiles, and threshold proportions, aggregated per interval
  _interval_statistics: 2x2 table of # of intervals covered to >= X depth in >=Y samples
  _gene_summary: total, mean, median, quartiles, and threshold proportions, aggregated per gene
  _gene_statistics: 2x2 table of # of genes covered to >= X depth in >= Y samples
  _cumulative_coverage_counts: coverage histograms (# locus with >= X coverage), aggregated over all bases
  _cumulative_coverage_proportions: proprotions of loci with >= X coverage, aggregated over all bases

requirements:
- class: ShellCommandRequirement
- class: InlineJavascriptRequirement
- class: DockerRequirement
  dockerPull: broadinstitute/gatk3:3.8-1

inputs:
- id: bam
  label: bam
  doc: Input file containing sequence  data (BAM or CRAM)
  type: File
  secondaryFiles:
  - |-
    ${

            function resolveSecondary(base, secPattern) {
              if (secPattern[0] == "^") {
                var spl = base.split(".");
                var endIndex = spl.length > 1 ? spl.length - 1 : 1;
                return resolveSecondary(spl.slice(undefined, endIndex).join("."), secPattern.slice(1));
              }
              return base + secPattern
            }

            return [
                    {
                        location: resolveSecondary(self.location, "^.bai"),
                        basename: resolveSecondary(self.basename, ".bai"),
                        class: "File",
                    }
            ];

    }
  inputBinding:
    prefix: -I
    position: 10
- id: reference
  label: reference
  doc: Reference sequence file
  type: File
  secondaryFiles:
  - pattern: .fai
  - pattern: .amb
  - pattern: .ann
  - pattern: .bwt
  - pattern: .pac
  - pattern: .sa
  - pattern: ^.dict
  inputBinding:
    prefix: -R
- id: outputPrefix
  label: outputPrefix
  doc: An output file created by the walker. Will overwrite contents if file exists
  type: string
  inputBinding:
    prefix: -o
- id: intervals
  label: intervals
  doc: One or more genomic intervals over which to operate
  type:
  - File
  - 'null'
  inputBinding:
    prefix: -L
- id: excludeIntervals
  label: excludeIntervals
  doc: One or more genomic intervals to exclude from processing
  type:
  - File
  - 'null'
  inputBinding:
    prefix: --excludeIntervals
- id: argFile
  label: argFile
  doc: Reads arguments from the specified file
  type:
  - File
  - 'null'
  inputBinding:
    prefix: --arg_file
- id: showFullBamList
  label: showFullBamList
  doc: Emit list of input BAM/CRAM files to log
  type:
  - boolean
  - 'null'
  inputBinding:
    prefix: --showFullBamList
- id: read_buffer_size
  label: read_buffer_size
  doc: Number of reads per SAM file to buffer in memory
  type:
  - int
  - 'null'
  inputBinding:
    prefix: --read_buffer_size
- id: read_filter
  label: read_filter
  doc: Filters to apply to reads before analysis
  type:
  - boolean
  - 'null'
  inputBinding:
    prefix: --read_filter
- id: disable_read_filter
  label: disable_read_filter
  doc: Read filters to disable
  type:
  - boolean
  - 'null'
  inputBinding:
    prefix: --disable_read_filter
- id: interval_set_rule
  label: interval_set_rule
  doc: Set merging approach to use for combining interval inputs (UNION|INTERSECTION)
  type:
  - string
  - 'null'
  inputBinding:
    prefix: --interval_set_rule
- id: interval_merging
  label: interval_merging
  doc: Set merging approach to use for combining interval inputs (UNION|INTERSECTION)
  type:
  - string
  - 'null'
  inputBinding:
    prefix: --interval_merging
- id: interval_padding
  label: interval_padding
  doc: Amount of padding (in bp) to add to each interval
  type:
  - int
  - 'null'
  inputBinding:
    prefix: --interval_padding
- id: nonDeterministicRandomSeed
  label: nonDeterministicRandomSeed
  doc: Use a non-deterministic random seed
  type:
  - boolean
  - 'null'
  inputBinding:
    prefix: --nonDeterministicRandomSeed
- id: maxRuntime
  label: maxRuntime
  doc: |-
    Unit of time used by maxRuntime (NANOSECONDS|MICROSECONDS|SECONDS|MINUTES|HOURS|DAYS)
  type:
  - string
  - 'null'
  inputBinding:
    prefix: --maxRuntime
- id: downsampling_type
  label: downsampling_type
  doc: Type of read downsampling to employ at a given locus (NONE|ALL_READS|BY.sample)
  type:
  - string
  - 'null'
  inputBinding:
    prefix: --downsampling_type
- id: downsample_to_fraction
  label: downsample_to_fraction
  doc: |-
    Fraction of reads to downsample to Target coverage threshold for downsampling to coverage
  type:
  - float
  - 'null'
  inputBinding:
    prefix: --downsample_to_fraction
- id: baq
  label: baq
  doc: |-
    Type of BAQ calculation to apply in the engine (OFF|CALCULATE_AS_NECESSARY|RECALCULATE)
  type:
  - string
  - 'null'
  inputBinding:
    prefix: --baq
- id: refactor_NDN_cigar_string
  label: refactor_NDN_cigar_string
  doc: Reduce NDN elements in CIGAR string
  type:
  - boolean
  - 'null'
  inputBinding:
    prefix: --refactor_NDN_cigar_string
- id: fixMisencodedQuals
  label: fixMisencodedQuals
  doc: Fix mis-encoded base quality scores
  type:
  - boolean
  - 'null'
  inputBinding:
    prefix: --fixMisencodedQuals
- id: allowPotentiallyMisencodedQuals
  label: allowPotentiallyMisencodedQuals
  doc: Ignore warnings about base quality score encoding
  type:
  - boolean
  - 'null'
  inputBinding:
    prefix: --allowPotentiallyMisencodedQuals
- id: useOriginalQualities
  label: useOriginalQualities
  doc: Use the base quality scores from the OQ tag
  type:
  - boolean
  - 'null'
  inputBinding:
    prefix: --useOriginalQualities
- id: defaultBaseQualities
  label: defaultBaseQualities
  doc: Assign a default base quality
  type:
  - int
  - 'null'
  inputBinding:
    prefix: --defaultBaseQualities
- id: performanceLog
  label: performanceLog
  doc: Write GATK runtime performance log to this file
  type:
  - string
  - 'null'
  default: generated
  inputBinding:
    prefix: --performanceLog
- id: BQSR
  label: BQSR
  doc: Input covariates table file for on-the-fly base quality score recalibration
  type:
  - File
  - 'null'
  inputBinding:
    prefix: --BQSR
- id: disable_indel_quals
  label: disable_indel_quals
  doc: Disable printing of base insertion and deletion tags (with -BQSR)
  type:
  - boolean
  - 'null'
  inputBinding:
    prefix: --disable_indel_quals
- id: emit_original_quals
  label: emit_original_quals
  doc: Emit the OQ tag with the original base qualities (with -BQSR)
  type:
  - boolean
  - 'null'
  inputBinding:
    prefix: --emit_original_quals
- id: preserve_qscores_less_than
  label: preserve_qscores_less_than
  doc: |-
    Don't recalibrate bases with quality scores less than this threshold (with -BQSR)
  type:
  - int
  - 'null'
  inputBinding:
    prefix: --preserve_qscores_less_than
- id: countType
  label: countType
  doc: |-
    overlapping reads from the same  fragment be handled? (COUNT_READS|COUNT_FRAGMENTS|COUNT_FRAGMENTS_REQUIRE_SAME_BASE)
  type:
  - string
  - 'null'
  inputBinding:
    prefix: --countType
- id: summaryCoverageThreshold
  label: summaryCoverageThreshold
  doc: Coverage threshold (in percent) for summarizing statistics
  type:
  - type: array
    inputBinding:
      prefix: -ct
    items: int
  - 'null'
  inputBinding: {}

outputs:
- id: sample
  label: sample
  doc: ''
  type: File
  outputBinding:
    glob: $(inputs.outputPrefix)
    loadContents: false
- id: sampleCumulativeCoverageCounts
  label: sampleCumulativeCoverageCounts
  doc: ''
  type: File
  outputBinding:
    glob: $((inputs.outputPrefix + ".sample_cumulative_coverage_counts"))
    outputEval: $((inputs.outputPrefix.basename + ".sample_cumulative_coverage_counts"))
    loadContents: false
- id: sampleCumulativeCoverageProportions
  label: sampleCumulativeCoverageProportions
  doc: ''
  type: File
  outputBinding:
    glob: $((inputs.outputPrefix + ".sample_cumulative_coverage_proportions"))
    outputEval: $((inputs.outputPrefix.basename + ".sample_cumulative_coverage_proportions"))
    loadContents: false
- id: sampleIntervalStatistics
  label: sampleIntervalStatistics
  doc: ''
  type: File
  outputBinding:
    glob: $((inputs.outputPrefix + ".sample_interval_statistics"))
    outputEval: $((inputs.outputPrefix.basename + ".sample_interval_statistics"))
    loadContents: false
- id: sampleIntervalSummary
  label: sampleIntervalSummary
  doc: ''
  type: File
  outputBinding:
    glob: $((inputs.outputPrefix + ".sample_interval_summary"))
    outputEval: $((inputs.outputPrefix.basename + ".sample_interval_summary"))
    loadContents: false
- id: sampleStatistics
  label: sampleStatistics
  doc: ''
  type: File
  outputBinding:
    glob: $((inputs.outputPrefix + ".sample_statistics"))
    outputEval: $((inputs.outputPrefix.basename + ".sample_statistics"))
    loadContents: false
- id: sampleSummary
  label: sampleSummary
  doc: ''
  type: File
  outputBinding:
    glob: $((inputs.outputPrefix + ".sample_summary"))
    outputEval: $((inputs.outputPrefix.basename + ".sample_summary"))
    loadContents: false
stdout: _stdout
stderr: _stderr

baseCommand:
- java
arguments:
- position: -3
  valueFrom: |-
    $("-Xmx{memory}G".replace(/\{memory\}/g, (([inputs.runtime_memory, 4].filter(function (inner) { return inner != null })[0] * 3) / 4)))
  shellQuote: false
- position: -2
  valueFrom: -jar /usr/GenomeAnalysisTK.jar
  shellQuote: false
- position: -1
  valueFrom: -T DepthOfCoverage
  shellQuote: false

hints:
- class: ToolTimeLimit
  timelimit: |-
    $([inputs.runtime_seconds, 86400].filter(function (inner) { return inner != null })[0])
id: Gatk3DepthOfCoverage